{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ⚖️ Create a legal preference dataset\n",
    "\n",
    "_Authored by: [David Berenstein](https://huggingface.co/davidberenstein1957) and [Sara Han Díaz](https://huggingface.co/sdiazlor)_\n",
    "\n",
    "In this tutorial, you will learn how to use the Notus model on  the HF Inference Endpoints to create a legal preference dataset based on Retrieval Augmented Generation instructions from the European AI Act. A full end-to-end example of how to use distilabel to leverage LLMs!\n",
    "\n",
    "[distilabel](https://github.com/argilla-io/distilabel) is an AI Feedback (AIF) framework that can generate and label datasets using LLMs and can be used for many different use cases. Implemented with robustness, efficiency, and scalability in mind, it allows anyone to build their synthetic datasets that can be used in many different scenarios.\n",
    "\n",
    "To generate the instruction dataset, we will use the [HF Inference Endpoints](https://huggingface.co/docs/inference-endpoints/en/index) integrated with distilabel. These Inference Endpoints are provided by Hugging Face and allow to easily deploy and run transformers, diffusers or any available model from the Hub on a dedicated and autoscaling infrastructure. You can find more information on how to create your first endpoint [here](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint).\n",
    "\n",
    "The LLM model that we will fine-tune for this is [Notus 7B](https://argilla.io/blog/notus7b/), a fine-tuned version of Zephyr 7B that uses Direct Preference Optimization (DPO) and AIF techniques to outperform its foundation model in several benchmarks and is completely open-source.\n",
    "\n",
    "This tutorial includes the following steps:\n",
    "\n",
    "- Defining a custom generating task for a `distilabel` pipeline.\n",
    "- Creating a RAG pipeline using Haystack for the EU AI Act.\n",
    "- Generating an instruction dataset with `SelfInstructTask`.\n",
    "- Generating a preference dataset using an `UltraFeedback` text quality task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "Let's start by installing the required dependencies to run **distilabel** and the rest of the packages used in the tutorial; most notably, **Haystack**. Install also **Argilla** for a better visualization and curation of the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q -U distilabel \"farm-haystack[preprocessing]\"\n",
    "!pip install -q -U \"distilabel[hf-inference-endpoints, argilla]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import dependencies\n",
    "\n",
    "The main dependencies for this tutorial are distilabel for creating the synthetic datasets and Argilla for visualizing and annotating these datasets, and also for fine-tuning our model. The package [Haystack](https://haystack.deepset.ai/) is used to create batches from the original PDF document we want to create our datasets from.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from typing import Dict\n",
    "\n",
    "from distilabel.llm import InferenceEndpointsLLM\n",
    "from distilabel.pipeline import Pipeline, pipeline\n",
    "from distilabel.tasks import TextGenerationTask, SelfInstructTask, Prompt\n",
    "\n",
    "from datasets import Dataset\n",
    "from haystack.nodes import PDFToTextConverter, PreProcessor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Environment variables\n",
    "\n",
    "We need to provide our HuggingFace access token, which can be retrieved from [Settings](https://huggingface.co/settings/tokens). In addition, we also need the OpenAI api key for the generation of the preference dataset through the UltraFeedback text quality task. You can find it [here](https://platform.openai.com/api-keys). Note that depending on the model used, a different fee will be charged, so make sure you check the OpenAI [pricing page](https://openai.com/pricing).\n",
    "\n",
    "To later instantiate an `InferenceEndpointsLLM` object, we need to pass as parameters the HF Inference Endpoint name and the HF namespace. One very convenient way to do so is also through environment variables.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"HF_TOKEN\"] = \"\"\n",
    "os.environ[\"HF_INFERENCE_ENDPOINT_NAME\"] = \"aws-notus-7b-v1-3184\"\n",
    "os.environ[\"HF_NAMESPACE\"] = \"argilla\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up an inference endpoint with Notus\n",
    "\n",
    "Inference endpoints are a solution, managed by Hugging Face, to easily deploy any Transformer-like model. They are built from models on the Hugging Face Hub. Inference endpoints are handy for making inference on LLMs without the hassle of trying to run the models locally. In this tutorial, we will use inference endpoints to generate text using our Notus model, as part of the `distilabel` workflow. The endpoint of choice has a [Notus 7B instance](https://ui.endpoints.huggingface.co/argilla/endpoints/aws-notus-7b-v1-4052) running.\n",
    "\n",
    "### Defining a custom generating task for a distilabel pipeline\n",
    "\n",
    "To kickstart this tutorial, let's see how to set up an endpoint for our Notus model. It's not part of the end-to-end example we'll see later, but an example of how to connect to a Hugging Face endpoint and a test of the `distilabel` pipeline.\n",
    "\n",
    "Let's dive into this quick example of how to use an inference endpoint. We have prepared an easy `TextGenerationTask` to ask questions to the model, in a very similar way as we talk with the LLMs using chatbots. First, we define a class for the question-answering task, with functions showing `distilabel` how the model should generate the prompts, parse the input and the output, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class QuestionAnsweringTask(TextGenerationTask):\n",
    "    def generate_prompt(self, question: str) -> str:\n",
    "        return Prompt(\n",
    "            system_prompt=self.system_prompt,\n",
    "            formatted_prompt=question,\n",
    "        ).format_as(\n",
    "            \"llama2\"\n",
    "        )  # type: ignore\n",
    "\n",
    "    def parse_output(self, output: str) -> Dict[str, str]:\n",
    "        return {\"answer\": output.strip()}\n",
    "\n",
    "    @property\n",
    "    def input_args_names(self) -> list[str]:\n",
    "        return [\"question\"]\n",
    "\n",
    "    @property\n",
    "    def output_args_names(self) -> list[str]:\n",
    "        return [\"answer\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`llm` is an object of the `InferenceEndpointsLLM` class, and by using it we can start generating answers to question using the `llm.generate()` method.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = InferenceEndpointsLLM(\n",
    "    endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"),  # type: ignore\n",
    "    endpoint_namespace=os.getenv(\"HF_NAMESPACE\"),  # type: ignore\n",
    "    token=os.getenv(\"HF_TOKEN\") or None,\n",
    "    task=QuestionAnsweringTask(),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the `InferenceEndpointsLLM` object defined with the endpoint information and the Task, we can go ahead and start generating text. Let's ask this LLM what's, for example, the second most populated city in Denmark. The answer should be Aarhus.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The second most populated city in Denmark is Aarhus, with a population of around 340,000 people. It is located on the east coast of Jutland, and is known for its vibrant cultural scene, beautiful beaches, and historic landmarks. Aarhus is also home to Aarhus University, one of the largest universities in Scandinavia.'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generation = llm.generate(\n",
    "    [{\"question\": \"What's the second most populated city in Denmark?\"}]\n",
    ")\n",
    "generation[0][0][\"parsed_output\"][\"answer\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The endpoint is working correctly! We have succesfully set up a custom generating task for a `distilabel` pipeline.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a RAG pipeline using Haystack for the European AI Act\n",
    "\n",
    "For this end-to-end example, we would like to create an expert model capable of answering question and filling up information about the new AI Act promoted by the European Union, which is the first regulation on artificial intelligence. As part of its digital strategy, the EU wants to regulate artificial AI to ensure better conditions for the development and use of this innovative technology. This act is a regulatory framework for AI, with different risk levels meaning more or less regulation. They are the world's first rules on AI.\n",
    "\n",
    "This RAG pipeline that we want to create downloads the PDF file, converts it to plain text and preprocess it, creating batches that we can feed `distilabel` to start creating instructions from it. Let's see this first part of the pipeline and get the input data. Note that this RAG part of the pipeline is not based on an active pipeline based queries or semantic properties, but a more brute-force approach in which we download the PDF and preprocess its contents.\n",
    "\n",
    "### Downloading the AI Act PDF\n",
    "\n",
    "Firstly, we need to download the PDF document itself. We'll place it in our working directory, if it's not there already.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "if [ ! -f \"The-AI-Act.pdf\" ]; then\n",
    "    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have it in our working directory, we can use Haystack's Converter and Pipeline features to extract the textual data, clean it and divide it in different batches. Afterwards, these batches will be used to start creating synthetic instructions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The converter turns the PDF into text we can process easily\n",
    "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
    "\n",
    "# Preprocessing pipelines can have several steps.\n",
    "# Ours clean empty lines, header, footers and whitespaces\n",
    "# and split the text into 150-char long batches, respecting\n",
    "# where the sentences naturally end and begin.\n",
    "preprocessor = PreProcessor(\n",
    "    clean_empty_lines=True,\n",
    "    clean_whitespace=True,\n",
    "    clean_header_footer=True,\n",
    "    split_by=\"word\",\n",
    "    split_length=150,\n",
    "    split_respect_sentence_boundary=True,\n",
    ")\n",
    "\n",
    "doc = converter.convert(file_path=\"The-AI-Act.pdf\", meta=None)[0]\n",
    "docs = preprocessor.process([doc])\n",
    "print(f\"Documents: 1\\nBatches: {len(docs)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a quick look at the batches we just generated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Int'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = [doc.content for doc in docs]\n",
    "inputs[0][0:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The document has been correctly batched, from one big document to 355 strings, 150-character long at maximum. This list of strings can now be used as input to generate a instruction dataset using `distilabel`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating instructions with SelfInstructTask\n",
    "\n",
    "With our Inference Endpoint up and running, we should be able to generate instructions with distilabel. These instructions, made by the LLM through our endpoint, will form an instruction dataset, with instructions created from the data we just extracted.\n",
    "\n",
    "For this example, we are using a subset of 50 batches generated in the section above, to be gentle on performance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['input'],\n",
       "    num_rows: 50\n",
       "})"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "instructions_dataset = Dataset.from_dict({\"input\": inputs[0:50]})\n",
    "\n",
    "instructions_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the `SelfInstructTask` class we can generate a Self-Instruct specitification for building the prompts, as done in the [Self-Instruct paper](https://arxiv.org/abs/2212.10560). `distilabel` will start from human-made input, in this case, the batches we created from the AI Act pdf, and it will generate instructions based on it. These instructions can then be reviewed using Argilla to keep the best ones.\n",
    "\n",
    "An application description can be passed as a parameter to specify the behaviour of the model; we want a model capable of answering our questions about the AI Act.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions_task = SelfInstructTask(\n",
    "    application_description=\"A assistant that can answer questions about the AI Act made by the European Union.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now define a generator, passing the `SelfInstructTask` object, and create a `Pipeline` object.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions_generator = InferenceEndpointsLLM(\n",
    "    endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"),  # type: ignore\n",
    "    endpoint_namespace=os.getenv(\"HF_NAMESPACE\"),  # type: ignore\n",
    "    token=os.getenv(\"HF_TOKEN\") or None,\n",
    "    task=instructions_task,\n",
    ")\n",
    "\n",
    "instructions_pipeline = Pipeline(generator=instructions_generator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our pipeline is ready to be used to generate instructions. Let's do it!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generated_instructions = instructions_pipeline.generate(\n",
    "    dataset=instructions_dataset, num_generations=1, batch_size=8\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pipeline has succesfully generated instructions given the topics and the behavior passed as input. Let's gather all those instructions and see how they look.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of generated instructions: 178\n",
      "What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\n",
      "How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\n",
      "What benefits can artificial intelligence bring to the European economy and society as a whole?\n",
      "How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\n",
      "What are the high-impact sectors that require AI action according to the AI Act by the European Union?\n"
     ]
    }
   ],
   "source": [
    "instructions = []\n",
    "for generations in generated_instructions[\"instructions\"]:\n",
    "    for generation in generations:\n",
    "        instructions.extend(generation)\n",
    "\n",
    "print(f\"Number of generated instructions: {len(instructions)}\")\n",
    "\n",
    "for instruction in instructions[:5]:\n",
    "    print(instruction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These initial instructions create our instruction dataset. Following the human-in-the-loop approach, we should push the instructions to Argilla to visualize them and be able to rank them in terms of quality. Those annotations are essential to make quality data, ensuring a better performance of the final model. Nevertheless, this step is optional.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pushing the instruction dataset to Argilla to visualize and annotate.\n",
    "\n",
    "Let's take a quick look at the instructions generated by `SelfInstructTask`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. ',\n",
       " 'generation_model': ['argilla/notus-7b-v1'],\n",
       " 'generation_prompt': ['You are an expert prompt writer, writing the best and most diverse prompts for a variety of tasks. You are given a task description and a set of instructions for how to write the prompts for an specific AI application.\\n# Task Description\\nDevelop 5 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model\\'s textual capabilities.\\n\\n# Criteria for Queries\\nIncorporate a diverse range of verbs, avoiding repetition.\\nEnsure queries are compatible with AI model\\'s text generation functions and are limited to 1-2 sentences.\\nDesign queries to be self-contained and standalone.\\nBlend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.\\nWrite each query on a separate line and avoid using numbered lists or bullet points.\\n\\n# AI Application\\nA assistant that can answer questions about the AI Act made by the European Union.\\n\\n# Context\\nEN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. \\n\\n# Output\\n'],\n",
       " 'raw_generation_responses': ['1. What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n2. How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\\n3. What benefits can artificial intelligence bring to the European economy and society as a whole?\\n4. How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\\n5. What competitive advantages can companies gain from using artificial intelligence?'],\n",
       " 'instructions': [['What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
       "   'How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?',\n",
       "   'What benefits can artificial intelligence bring to the European economy and society as a whole?',\n",
       "   'How can the use of artificial intelligence support socially and environmentally beneficial outcomes?']]}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generated_instructions[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each input, i.e., each batch of the AI Act pdf file, we have a generator prompt, with general guidelines on how to behave, as well as the application description parameter. 4 instructions per input have been generated.\n",
    "\n",
    "Now it's the perfect time to upload the instruction dataset to Argilla, review it and manually annotate it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FeedbackRecord(fields={'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.', 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?'}, metadata={'length-input': 964, 'length-instruction': 129, 'generation-model': 'argilla/notus-7b-v1'}, vectors={}, responses=[], suggestions=(), external_id=None)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "instructions_rg_dataset = generated_instructions.to_argilla()\n",
    "instructions_rg_dataset[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions_rg_dataset.push_to_argilla(name=f\"notus_AI_instructions\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the Argilla UI, each tuple input-instruction is visualized individually, and can be individually annotated.\n",
    "\n",
    "![Instruction dataset](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/instrucion_dataset_notus_ui.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate a Preference Dataset using an Ultrafeedback text quality task.\n",
    "\n",
    "Once we have our instruction dataset, we are going to create a preference dataset through the UltraFeedback text quality task. This is a type of task used in NLP used to evaluate the quality of text generated; our goal is to provide detailed feedback on the quality of the generated text, beyond a binary label.\n",
    "\n",
    "Our `pipeline()` method allows us to create a `Pipeline` instance with the provided LLMs for a given task, which is useful whenever you want to use a pre-defined or custom `Pipeline` for a given task. We will specify our task and subtask, the generator we want to use (in this case, one based in a Text Generator Task) and our OpenAI API key.\n",
    "\n",
    "> Note that not using a OpenAI model to retrieve this feedback is also possible. However, the performance will suffer and the quality of the feedback will be lower."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preference_pipeline = pipeline(\n",
    "    \"preference\",\n",
    "    \"instruction-following\",\n",
    "    generator=InferenceEndpointsLLM(\n",
    "        endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"),  # type: ignore\n",
    "        endpoint_namespace=os.getenv(\"HF_NAMESPACE\", None),\n",
    "        task=TextGenerationTask(),\n",
    "        max_new_tokens=256,\n",
    "        num_threads=2,\n",
    "        temperature=0.3,\n",
    "    ),\n",
    "    max_new_tokens=256,\n",
    "    num_threads=2,\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\", None),\n",
    "    temperature=0.0,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to retrieve our instruction dataset from Argilla, as it will be the input of this pipeline.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['input', 'instruction', 'instruction-rating', 'instruction-rating-suggestion', 'instruction-rating-suggestion-metadata', 'external_id', 'metadata'],\n",
       "    num_rows: 100\n",
       "})"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "remote_dataset = rg.FeedbackDataset.from_argilla(\n",
    "    \"notus_AI_instructions\", workspace=\"admin\"\n",
    ")\n",
    "instructions_dataset = remote_dataset.pull(max_records=100)  # get first 100 records\n",
    "\n",
    "instructions_dataset = instructions_dataset.format_as(\"datasets\")\n",
    "instructions_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n",
       " 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
       " 'instruction-rating': [],\n",
       " 'instruction-rating-suggestion': None,\n",
       " 'instruction-rating-suggestion-metadata': {'type': None,\n",
       "  'score': None,\n",
       "  'agent': None},\n",
       " 'external_id': None,\n",
       " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}'}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "instructions_dataset[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before generating the text based on our instructions, we need to rename some of the columns in our dataset. From the previous section, we still have our old input, the batches from the PDF. We have to change that to the instructions that we generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "instructions_dataset = instructions_dataset.rename_columns({\"input\": \"context\", \"instruction\": \"input\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's build a dataset by using the pipeline we just created, and the topics from which our instructions were generated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preference_dataset = preference_pipeline.generate(\n",
    "    instructions_dataset,  # type: ignore\n",
    "    num_generations=2,\n",
    "    batch_size=8,\n",
    "    display_progress_bar=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at an instance of the preference dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'context': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n",
       " 'input': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
       " 'instruction-rating': [],\n",
       " 'instruction-rating-suggestion': None,\n",
       " 'instruction-rating-suggestion-metadata': {'agent': None,\n",
       "  'score': None,\n",
       "  'type': None},\n",
       " 'external_id': None,\n",
       " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}',\n",
       " 'generation_model': ['argilla/notus-7b-v1', 'argilla/notus-7b-v1'],\n",
       " 'generation_prompt': [\"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\",\n",
       "  \"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\"],\n",
       " 'raw_generation_responses': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n",
       "  '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n",
       " 'generations': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n",
       "  '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n",
       " 'labelling_model': 'gpt-3.5-turbo',\n",
       " 'labelling_prompt': [{'content': 'Your role is to evaluate text quality based on given criteria.',\n",
       "   'role': 'system'},\n",
       "  {'content': \"\\n# Instruction Following Assessment\\nEvaluate alignment between output and intent. Assess understanding of task goal and restrictions.\\n**Instruction Components**: Task Goal (intended outcome), Restrictions (text styles, formats, or designated methods, etc).\\n\\n**Scoring**: Rate outputs 1 to 5:\\n\\n1. **Irrelevant**: No alignment.\\n2. **Partial Focus**: Addresses one aspect poorly.\\n3. **Partial Compliance**:\\n\\t- (1) Meets goal or restrictions, neglecting other.\\n\\t- (2) Acknowledges both but slight deviations.\\n4. **Almost There**: Near alignment, minor deviations.\\n5. **Comprehensive Compliance**: Fully aligns, meets all requirements.\\n\\n---\\n\\n## Format\\n\\n### Input\\nInstruction: [Specify task goal and restrictions]\\n\\nTexts:\\n\\n<text 1> [Text 1]\\n<text 2> [Text 2]\\n\\n### Output\\n\\n#### Output for Text 1\\nRating: [Rating for text 1]\\nRationale: [Rationale for the rating in short sentences]\\n\\n#### Output for Text 2\\nRating: [Rating for text 2]\\nRationale: [Rationale for the rating in short sentences]\\n\\n---\\n\\n## Annotation\\n\\n### Input\\nInstruction: What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n\\nTexts:\\n\\n<text 1> \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\\n<text 2> \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the\\n\\n### Output \",\n",
       "   'role': 'user'}],\n",
       " 'raw_labelling_response': '#### Output for Text 1\\nRating: 5\\nRationale: The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.\\n\\n#### Output for Text 2\\nRating: 4\\nRationale: The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.',\n",
       " 'rating': [5.0, 4.0],\n",
       " 'rationale': ['The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.',\n",
       "  'The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.']}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preference_dataset[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Human Feedback with Argilla\n",
    "\n",
    "You can use the AI Feedback created by distilabel directly but we have seen that enhancing it with human feedback will improve the quality of your LLM. We provide a `to_argilla` method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check [the Argilla docs](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) to get it up and running."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argilla as rg\n",
    "\n",
    "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
    "# Replace api_key if you configured a custom API key\n",
    "rg.init(\n",
    "    api_url=\"http://localhost:6900\",\n",
    "    api_key=\"owner.apikey\",\n",
    "    workspace=\"admin\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once our preference dataset has been correctly generated, the Argilla UI is the best tool at our disposal to visualize and annotate it. As for the instruction dataset, we just have to convert it to an Argilla Feedback Dataset, and push it to Argilla."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uploading the Preference Dataset\n",
    "preference_rg_dataset = preference_dataset.to_argilla()\n",
    "\n",
    "# Adding the context as a metadata property in the new Feedback dataset, as this\n",
    "# information will be useful later.\n",
    "for record_feedback, record_huggingface in zip(\n",
    "    preference_rg_dataset, preference_dataset\n",
    "):\n",
    "    record_feedback.metadata[\"context\"] = record_huggingface[\"context\"]\n",
    "\n",
    "preference_rg_dataset.push_to_argilla(name=f\"notus_AI_preference\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the Argilla UI, we can see the input (an instruction), and the two generations that the LLM created out of it.\n",
    "\n",
    "![Preference dataset](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/preference_dataset_notus_ui.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusions\n",
    "\n",
    "To conclude, we have gone through an end-to-end example of distilabel. We've set up an Inference Endpoint, defined a distilabel pipeline that extracts information from a PDF, and created and manually reviewed the instruction and preference dataset created from that input. The final preference dataset is perfect for fine-tuning, and you can easily do this using the ArgillaTrainer from Argilla. Have a look at these resources if you want to go further:\n",
    "\n",
    "- [Train a Model with ArgillaTrainer](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/end2end_examples/train-model-006.html)\n",
    "- [Ⓜ️ Finetuning LLMs as chat assistants: Supervised Finetuning on Mistral 7B](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html)\n",
    "- [🌠 Improving RAG by Optimizing Retrieval and Reranking Models](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/fine-tuning-sentencesimilarity-rag.html)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
